Encrypting genomic data for storage and genomic computations

ABSTRACT

Genomic data encryption embodiments are presented which generally maintain the privacy of genomic data via an encryption scheme which allows computations to be performed on the encrypted data without the need for decryption. The genomic data is encrypted using a homomorphic polynomial encryption scheme to produce a vector of ciphertexts, where each ciphertext represents a different sample of the genomic data and takes the form of a polynomial and its associated coefficients. Computations on the encrypted genomic data are then performed on the vector or vectors of ciphertexts without decrypting the data. The results of the computations are then provided to an end user who decrypts them.

BACKGROUND

The development of cloud storage and services has allowed users tooffload both storage of their data and associated computations on thatdata. As a result, businesses can choose to forego the expensiveproposition of maintaining their own data centers, relying instead oncloud storage and computational services.

One type of data amenable to cloud storage and computational services isgenomic data. The field of genomics involves analyzing the function andstructure of genomes. This includes DNA sequencing and genetic mapping,as well as the study of interactions between loci and alleles within thegenome. Human genomic data can be mined to identify variants in genesthat can contribute to diseases. However, a large and diverse genomicdata set is needed to identify these genetic links. To this end largedatabases of genomic data are being established.

SUMMARY

Genomic data encryption embodiments described herein generally maintainthe privacy of genomic data via an encryption scheme which allowscomputations to be performed on the encrypted data without the need fordecryption.

In one embodiment, the genomic data is first encoded as polynomials in amessage space of a homomorphic encryption scheme. Then the encodedgenomic data is encrypted using the homomorphic polynomial encryptionscheme to produce a vector of ciphertexts, where each ciphertextrepresents a different sample of the genomic data and takes the form ofa polynomial and its associated coefficients. The aforementionedcomputations on the encrypted genomic data are then performed on thevector or vectors of ciphertexts without decrypting the data.

With regard to the aforementioned encoding of the genomic data, in oneembodiment this involves, for each real number making up the data,encoding the number by generating a bit decomposition of the number,converting the bit decomposition to a truncated bit decomposition {rightarrow over (α)}=(α_(k) , , , α₀, α⁻¹ , , , α_(−u)) based on the desiredprecision u, and then encoding the real number as a polynomial using

${{e_{\alpha}(x)}\overset{def}{=}{{\sum\limits_{i = 0}^{k + u}\;{\beta_{i}x^{i}}} \in R_{2}}},$where β_(i)=α_(i−u) and k+u is the total number of bits in the truncatedbit decomposition. The encoded real number is then encrypted using anappropriate homomorphic encryption scheme. Computations can then beperformed on the encrypted real number data without decryption toproduce an encrypted result. In one embodiment, this involves computingon the encrypted real number data using an equation in the form ofG(e_(α))=Σ_(i=0) ^(D)a_(i)·x^((D−i)u)·e_(α) ^(i), where D is the degreeof the polynomial and the a_(i)'s are prescribed coefficients. Theencrypted result can be decrypted using the appropriate homomorphicdecryption. Then, the decrypted results are transformed using

${{F\left( \overset{\sim}{\alpha} \right)} = \frac{{G\left( e_{\alpha} \right)}\left( {{mod}\;\left( {x - 2} \right)} \right)}{2^{Du}}},$and evaluated at x=2 to obtain {tilde over (α)}, which represents thetruncated real number representing the results.

The foregoing encrypting, computations and decrypting can beaccomplished by separate entities. For example, a user can encrypt thedata and transmit it for storage in the cloud. In addition, while in thecloud, computations can be performed on the encrypted data, withoutdecryption. The results of these computations can then be provided to anend user (who can be the encrypting user) in encrypted form. The enduser then decrypts the results.

It should also be noted that this Summary is provided to introduce aselection of concepts, in a simplified form, that are further describedbelow in the Detailed Description. This Summary is not intended toidentify key features or essential features of the claimed subjectmatter, nor is it intended to be used as an aid in determining the scopeof the claimed subject matter.

DESCRIPTION OF THE DRAWINGS

The specific features, aspects, and advantages of the disclosure willbecome better understood with regard to the following description,appended claims, and accompanying drawings where:

FIG. 1 is a flow diagram generally outlining one embodiment of a processfor genomic data encryption.

FIG. 2 depicts a simplified diagram of a computing environment forreceiving encoded and encrypted genomic data from a user, storing theencrypted data, performing computations on the data without decryptingit first, and providing the results of the computations in the sameencrypted form to the user.

FIG. 3 is a flow diagram generally outlining one embodiment of a processfor performing genomic computations on encoded and encrypted genomicdata without decrypting the data.

FIG. 4 is a flow diagram generally outlining an implementation of thepart of the process of FIG. 3 involving the computation of genotypefrequency for a single locus.

FIG. 5 is a flow diagram generally outlining an implementation of thepart of the process of FIG. 3 involving the computation of genotype pairfrequency for two loci.

FIG. 6 is a flow diagram generally outlining an implementation of thepart of the process of FIG. 3 involving the computation ofgenotype/phenotype frequency for a single locus.

FIG. 7 is an exemplary contingency table of 3 genotypes vs.case/controls.

FIG. 8 is a diagram depicting a general purpose computing deviceconstituting an exemplary system for implementing genomic dataencryption embodiments described herein.

FIG. 9 is a flow diagram generally outlining one embodiment of a processfor encrypting real numbers using a homomorphic polynomial encryptionscheme.

FIG. 10 is a flow diagram generally outlining one embodiment of aprocess for converting encrypted real numbers to a more noise resistantpolynomial form before performing computations on the encrypted numbers,all without first decrypting the numbers.

FIG. 11 is a flow diagram generally outlining one embodiment of aprocess for decrypting the results of computations on encrypted realnumbers.

DETAILED DESCRIPTION

In the following description of genomic data encryption embodimentsreference is made to the accompanying drawings which form a part hereof,and in which are shown, by way of illustration, specific embodiments inwhich the technique may be practiced. It is understood that otherembodiments may be utilized and structural changes may be made withoutdeparting from the scope of the technique.

It is also noted that for the sake of clarity specific terminology willbe resorted to in describing the genomic data encryption embodimentsdescribed herein and it is not intended for these embodiments to belimited to the specific terms so chosen. Furthermore, it is to beunderstood that each specific term includes all its technicalequivalents that operate in a broadly similar manner to achieve asimilar purpose. Reference herein to “one embodiment”, or “anotherembodiment”, or an “exemplary embodiment”, or an “alternate embodiment”,or “one implementation”, or “another implementation”, or an “exemplaryimplementation”, or an “alternate implementation” means that aparticular feature, a particular structure, or particularcharacteristics described in connection with the embodiment orimplementation can be included in at least one embodiment of the genomicdata encryption. The appearances of the phrases “in one embodiment”, “inanother embodiment”, “in an exemplary embodiment”, “in an alternateembodiment”, “in one implementation”, “in another implementation”, “inan exemplary implementation”, “in an alternate implementation” invarious places in the specification are not necessarily all referring tothe same embodiment or implementation, nor are separate or alternativeembodiments/implementations mutually exclusive of otherembodiments/implementations. Yet further, the order of process flowrepresenting one or more embodiments or implementations of the genomicdata encryption does not inherently indicate any particular order norimply any limitations.

1.0 Genomic Data Encryption

In genetics, a locus is the specific location of a gene or DNA sequenceor position on a chromosome. A variant of the similar DNA sequencelocated at a given locus is called an allele. A single-nucleotidepolymorphism (SNP) is a DNA sequence variation occurring when a singlenucleotide in the genome (or other shared sequence) differs betweenmembers of a biological species or paired chromosomes in a human. Inthis case it is said that there are two alleles. Chromosomes having thesame allele of a given gene at some locus are called homozygous withrespect to that gene, while those that have different alleles of a givengene at a locus, are called heterozygous with respect to that gene.

It is advantageous to protect the privacy of individuals who donatetheir DNA to research, or patients undergoing genomic studies. This isparticularly important for cloud storage of the genomic data andcloud-based computations performed using the stored data. Thesecloud-based computations take individuals' genomic data as input, whichpotentially compromises patient privacy, even in de-identified datasets. This is because it is possible to find the identity of a personusing genomic data and publicly available records. Accordingly, manycloud storage solutions employ encryption on the user's data to preservedata privacy. Unfortunately, it is difficult to efficiently performmeaningful computations on encrypted data without decrypting the datafirst. The genomic data encryption embodiments described herein protectthe privacy of individuals' genomic data while allowing thesecomputational studies to be conducted without decryption.

More particularly the genomic data encryption embodiments describedherein employ homomorphic encryption to encrypt encoded genomic data andthen to compute on the encrypted data. To this end, referring to FIG. 1,in one embodiment genomic data is first encoded as polynomials in amessage space of a homomorphic encryption scheme (process action 100).The encoded genomic data is then encrypted using the homomorphicpolynomial encryption scheme to produce a vector of ciphertexts (processaction 102). Generally the term “ciphertext” refers to an encrypted dataset (e.g., an encrypted message, an encrypted data bit, encrypted text,etc.). In the context of the genomic data encryption embodimentsdescribed herein, the ciphertext refers to encrypted genomic data.

It is noted that each ciphertext in a ciphertext vector represents adifferent sample of the genomic data (e.g., a discrete sample of thegenotype or genotype/phenotype pairing associated with asingle-nucleotide polymorphism (SNP)) and takes the form of a polynomialand its associated coefficients. It is further noted that multipleciphertext vectors can be generated, each of which could representgenomic data associated with a different locus. Once the genomic data isencoded and encrypted, it can be transmitted for storage and genomiccomputations (process action 104). This transmission can be via acomputer network (such as the Internet).

With regard to the foregoing action of encrypting the encoded genomicdata using a homomorphic polynomial encryption scheme, generally anyappropriate homomorphic polynomial encryption method can be employed forthis purpose. For example, in one embodiment, a somewhat homomorphicencryption (SwHE) scheme is employed. This SwHE scheme, represented bythe expression SwHE=(SH.Keygen,SH.Enc,SH.Add,SH.Mult,SH.Dec), isassociated with a number of parameters:

-   -   the dimension n, which is a power of 2;    -   the cyclotomic polynomial ƒ(x)=x^(n)+1;    -   the modulus q, which is a prime number such that q≡1(mod 2n)        (together, n, q, and ƒ(x) define rings R=Z|x|/        ƒ(x)        and R_(q)=R/qR);    -   the error parameter σ, which defines a discrete Gaussian error        distribution χ=D_(Z) _(n) _(,σ) with a standard deviation σ;    -   a prime number t<q, which defines the message space of the        scheme as R_(t)=R/tR, the ring of integer polynomials modulo        ƒ(x) and t; and    -   a number D>0, which defines a bound on the maximum number of        multiplications that can be performed correctly using the        scheme.

In one embodiment, the SwHE scheme is a function of the followingcomponent operations:

-   -   SH.Keygen(1^(K)): a key generation operation, which in one        implementation includes (1) sampling a ring element s←χ, (2)        defining a secret key sk=s, (3) sampling a uniformly random ring        element a₁←R_(q) and an error e←χ, and (4) computing a public        key pk=(a₀=−(a₁s+te), a₁);    -   SH.Enc(pk,m): an encoding operation, which in one implementation        includes: (1) encoding the message m as a degree n polynomial        with coefficients in Z_(t)—given the public key pk=(a₀,a₁) and a        message mεR_(q), the encryption algorithm samples u←χ and ƒ,g←χ,        and (2) computing the ciphertext ct=(c₀,c₁)=(a₀u+tg+m,a₁u+tƒ);        and    -   SH.Dec(sk, ct=(c₀, c₁, . . . , c_(δ))): a decryption operation,        which in one implementation includes: (1) decrypting by        computing

${\overset{\sim}{m} = {{\sum\limits_{i = 0}^{\delta}\;{c_{i}s^{i}}} \in R_{q}}},$and (2) outputting the message as {tilde over (m)}(mod t).

In addition, the SwHE scheme includes homomorphic operations SH.Add andSH.Mult. In one embodiment, in order to homomorphically compute anarbitrary function ƒ, an arithmetic circuit for ƒ (made of addition andmultiplication operations over Z_(t)) may be constructed. The SH.Add andSH.Mult operations are used to iteratively compute ƒ on encryptedinputs. Although the ciphertexts produced by SH.Enc contain two ringelements, the homomorphic operations increase the number of ringelements in the ciphertext. In general, the SH.Add and the SH.Multoperations get as input two ciphertexts ct=(c₀, c₁, . . . , c_(δ)) andct′=(c₀′, c₁′, . . . , c_(γ)′). The output of SH.Add contains max(δ+1,γ+1) ring elements, whereas the output of SH.Mult contains δ+γ+1 ringelements.

-   -   SH.Add(pk, ct₀, ct₁): Let ct=(c₀, c₁, . . . , c_(δ)) and        ct′=(c₀′, c₁′, . . . , c_(γ)′) be two ciphertexts. Assume that        δ=γ, otherwise, pad the shorter ciphertext with zeroes.        Homomorphic addition is accomplished by component-wise addition        of the ciphertexts. Namely, compute and output        ct _(add)=(c ₀ +c ₀ ′,c ₁ +c ₁ ′, . . . ,c _(max(δ,γ)) +c′        _(max(δ,γ)),)εR _(q) ^(max(δ,γ))    -   SH.Mult(pk, ct₀, ct₁): Let ct=(c₀, c₁, . . . , c_(δ)) and        ct′=(c₀′, c₁′, . . . , c_(γ)′) be two ciphertexts. Let v be a        symbolic variable and consider the expression

$\begin{matrix}{{{\left( {\sum\limits_{i = 0}^{\delta}\;{c_{i}v^{i}}} \right) \cdot \left( {\sum\limits_{i = 0}^{\gamma}\;{c_{i}^{\prime}v^{i}}} \right)}\mspace{14mu}{over}\mspace{14mu} R_{q}},} & (1)\end{matrix}$Eq. (1) can be decomposed by symbolically treating v as an unknownvariable to compute ĉ₀, . . . , ĉ_(δ+λ)εR_(q) such that for all vεR_(q)

$\begin{matrix}{{\left( {\sum\limits_{i = 0}^{\delta}\;{c_{i}v^{i}}} \right) \cdot \left( {\sum\limits_{i = 0}^{\gamma}\;{c_{i}^{\prime}v^{i}}} \right)} \equiv {\sum\limits_{i = 0}^{\delta + \gamma}\;{{\hat{c}}_{i}v^{i}}}} & (2)\end{matrix}$The output ciphertext is ct_(mult)=(ĉ₀, . . . , ĉ_(δ+γ)).

With regard to the foregoing action of transmitting the encoded andencrypted genomic data for storage and genomic computations, this caninvolve sending the data to cloud storage and performing thecomputations in the cloud as well. In this context, being “in the cloud”refers to computing concepts involving a large number of computersconnected through a real-time computer network (such as the Internet).The encoded and encrypted data can be stored on one or more of thesecomputers, and the aforementioned computations can be perform on one ormore of these computers as well (either the same computer or computers,a different computer or computers, or any combination thereof).

FIG. 2 illustrates an example of the foregoing computing environment 200for receiving encoded and encrypted genomic data from a user, storingthe encrypted data, performing computations on the data withoutdecrypting it first, and providing the results of the computations inthe same encrypted form to the user. The user, as represented in theFIG. 2 by the user computer 202 (which is in communication with acomputer network such as the Internet), encodes and encrypts the genomicdata as described previously. This encrypted data 204 is transmitted viathe computer network to a cloud storage and computation framework 206(which can be one or more computers also in communication with thecomputer network). The encrypted genomic data 204 is received by one ormore storage devices 208 residing within the cloud storage andcomputation framework 206. The aforementioned genomic computations arethen performed on the encrypted data 204. In the depicted example one ormore computation devices 210 residing within the cloud storage andcomputation system 206 obtain the stored encrypted data 204 from thestorage device or devices 208 and perform the genomic computationswithout decrypting the data. The computation device or devices 210 thentransmit the results of the genomic computations 212 to an end user. Inthe depicted example, the end user is the user computer 202 (althoughthis need not be the case). It is noted that the results exhibit thesame encryption as the genomic data. It is further noted that in otherembodiments, the storage device or devices could also perform thecomputations and so act as the one or more computation devices.

Referring now to FIG. 3, an exemplary process for performing theaforementioned genomic computations is outlined. First, one or morecomputers (e.g., in the cloud) receive the encoded and encrypted genomicdata (process action 300). As indicated previously this data includes atleast one vector of ciphertexts, each ciphertext of which represents adifferent sample of the genomic data and takes the form of a polynomialand its associated coefficients. One or more computations are performedon the vector or vectors of ciphertexts without having to decrypt anddecode the underlying genomic data (process action 302). These genomiccomputations correspond to statistical analyses that have been developedby computational biologists and statisticians to conduct genomiccorrelation studies on genomic data in populations. For example, as willbe described in greater detail later, some of the genomic computationspossible include the Pearson Goodness-of-Fit (Chi-Squared) Test tomeasure data quality, the Cochran-Armitage Test for trends on thecorrelations between genotypes and phenotypes, Estimation MaximizationAlgorithm for Haplotyping to estimate haplotype frequencies fromgenotype counts, and Linkage Disequilibrium statistic to estimatecorrelation between genes. These types of genomic computations involvemultiplication and addition of the ciphertext vector(s).

Once the foregoing genomic calculations are complete, the results areprovided to an end user (process action 304). It is noted that the enduser can be the same entity that encoded and encrypted the genomic datain the first place (as shown in FIG. 2), or a different party that isauthorized by the encoding and encrypting party to decrypt and decodethe results. The results can be provided via a transmission over acomputer network (such as the Internet).

1.1 Encoding and Encrypting Genotypes

For a plaintext m, the encryption of m is denoted as {circumflex over(m)}. For a single SNP, there are 3 possible genotypes which arerepresented as 0,1,2 with 1 being the heterozygous genotype and 0,2being the homozygous genotypes. Additionally, there may be a missinggenotype which is represented as −9.

The genotype numbers are encoded as elements in the ring

$R_{q}\overset{def}{=}{R/{qR}}$where

$R\overset{def}{=}{{Z\lbrack x\rbrack}/\left\langle {x^{n} + 1} \right\rangle}$(as described previously), and this encoding can be arbitrarily chosenas E_(g):{0,1,2,−9}→R_(q). To ensure that the coefficients of theencoding (in R_(q)) are small, the following further encoding isemployed:

$\begin{matrix}{{E_{g}(z)}\overset{def}{=}\left\{ {{\begin{matrix}{- 1} & {z = 0} \\0 & {z = 1} \\1 & {z = 2} \\x^{\tau} & {z = {- 9}}\end{matrix}{where}\mspace{14mu}\tau}\overset{def}{=}{\log\; t}} \right.} & (3)\end{matrix}$Note that t=2^(τ) is defined to be a power of 2.

Next, an indicator function g_(i) is defined by:

$\begin{matrix}{{g_{i}(z)} = \left\{ \begin{matrix}1 & {z = {E_{g}(i)}} \\0 & {z \neq {E_{g}(i)}}\end{matrix} \right.} & (4)\end{matrix}$

The input is a vector

$\overset{->}{z}\overset{def}{=}{\left( {z_{1},,z_{k}} \right)^{T} \in \left\{ {{- 1},0,1,x^{\tau}} \right\}}$of genotype samples. Let (2⁻¹)_(q) denote the inverse of 2 modulo q.Lagrange interpolation can be employed to find the polynomial computingeach g_(i). In one implementation, the following polynomials arecomputed (over R_(q)):

$\begin{matrix}{{{g_{0}(z)}\overset{def}{=}{\left( 2^{- 1} \right)_{q} \cdot \left( {z^{2} - z} \right)}},{{g_{1}(z)}\overset{def}{=}{\left( {1 - z^{2}} \right)\left( {1 + {z \cdot x^{n - \tau}}} \right)}},{{g_{2}(z)}\overset{def}{=}{\left( 2^{- 1} \right)_{q} \cdot \left( {z^{2} + z} \right)}}} & (5)\end{matrix}$As a sanity check, note that for zε{−1,0,1}, g_(i)(z)=1 if z=E(i) andg_(i)(z)=0 if z≠E(i). When z=−9, it is desired that g_(i)(z)=0 always.This is not achieved exactly, but instead the functionally equivalentvalues are:g ₀(x ^(τ))=(2⁻¹)_(q)·(x ^(2τ) −x ^(τ)),g ₁(x ^(τ))=(1−x ^(2τ))(1+x^(n))=0,g ₂(x ^(τ))=(2⁻¹)_(q)·(x ^(2τ) +x ^(τ))  (6)

Note that g₁(x^(τ))=0 because 1+x^(n)=0 in R_(q). The case for g₀ and g₂is different, but functionally equivalent because at decoding thesepolynomials will evaluate to 0. More particularly, to decode apolynomial p(x), p(x) mod t is computed where x=2 and mod t=x^(τ).Accordingly, the encoding for a missing genotype, x^(τ), decodes to 0but its encoding is non-zero, guaranteeing that all encodings aredistinct.

Further, as indicted previously, care is taken so that after the entirecomputation of the algorithm, there is no reduction mod x^(n)+1. Thiswill guarantee that decoding gives the correct output. Therefore, if thealgorithm it is desired to compute can be represented as a polynomial ofdegree D, care is taken to ensure that 2^(D)·τ<n.

Once the polynomials are computed and it has been determined there is noreduction mod x^(n)+1, the encryption can proceed as describedpreviously.

1.1.1 Counting Genotype Frequencies

The first step in certain genomic computations is to compute genotypefrequencies. For a specific genotype iε{0,1,2} the total number ofsamples with genotype i is computed (call it N_(i)) by summing over allsamples z. In term of computing genotype frequencies on the nowencrypted genomic data, in one embodiment, the following procedure forcounting genotypes is employed. The genotype counts represent theaforementioned frequencies.

Given a vector of ciphertexts ({circumflex over (z)}₁ , , , {circumflexover (z)}_(K)) encrypting genotype samples in {−1,0,1,x^(τ)}, thegenotype frequencies are counted using:{circumflex over (N)} ₀←Σ_(k=1) ^(K) g ₀({circumflex over (z)}_(k)),{circumflex over (N)} ₁←Σ_(k=1) ^(K) g ₁({circumflex over (z)}_(k)),{circumflex over (N)} ₂←Σ_(k=1) ^(K) g ₂({circumflex over (z)}_(k))  (7)This produces ciphertexts {circumflex over (N)}₀, {circumflex over(N)}₁, {circumflex over (N)}₂ such that deg({circumflex over (N)}₀,{circumflex over (N)}₂)=2 and deg({circumflex over (N)}₁)=3, where{circumflex over (N)}₀, {circumflex over (N)}₁, {circumflex over (N)}₂represent the encrypted genotype frequencies of the first homozygousgenotype (0), the heterozygous genotype (1), and the second homozygousgenotype (2), respectively.

In procedural terms, referring to FIG. 4, the foregoing genotypefrequency computation involves first receiving a vector of ciphertextseach of which represents an encrypted genotype sample (process action400). Then, an encryption of a count of each genotype present in thereceived vector of ciphertexts is computed as a measure of the frequencyof that genotype (process 402). It is noted that this last action isperformed without decrypting the genomic data.

1.1.2 Genomic Computations Using Encrypted Genotype Frequencies

Once the encrypted genotype frequencies are computed, various genomiccomputation can be performed using the encrypted data. For example, theaforementioned Pearson Goodness-of-Fit (Chi-Squared) Test, EstimationMaximization Algorithm for Haplotyping and Linkage Disequilibriumcomputations can be performed, as will be described in more detail inthe sections to follow.

It is noted that certain definitions apply to the following descriptionsof the genomic computations. For example, the χ² distribution with kdegrees of freedom is defined as the distribution obtained from addingthe squares of k independent standard normal random variables. Inaddition, the p-value of a test statistic T in comparison todistribution D is defined as the probability according to D of havingobserved an outcome at least as extreme as the value observed. For teststatistics T>0, the p-value of T in comparison to D isPr[D≧T]=1−CDF_(D)(T). For significance level α and p-value p, it isconcluded that the deviation of T from D is statistically significant ifp<α. Common significance levels are 0.05 and 0.01.

1.1.2.1 Pearson Goodness-of-Fit (Chi-Squared) Test

The Pearson Goodness-Of-Fit (Chi-Squared) Test is a test for deviationsfrom Hardy-Weinberg Equilibrium. Hardy-Weinberg Equilibrium (HWE) is aprinciple that states that allele frequencies will stay the same fromgeneration to generation unless perturbed by evolutionary influences.Equivalently, the HWE states that allele frequencies are independent.More formally, consider two alleles A and a and let p_(A), p_(a) betheir corresponding population frequencies, so that p_(a)=1−p_(A).Similarly, let p_(AA), p_(Aa), p_(aa) be the corresponding populationfrequencies for genotypes AA, Aa, aa. Then alleles A and a areindependent (and HWE holds) if:p _(AA) =p _(A) ² ,p _(Aa)=2p _(A) p _(a) ,p _(aa) =p _(a) ².  (8)

Let N_(AA), N_(Aa), N_(aa) be the observed counts for genotypes AA, Aa,aa, respectively, and let

$N\overset{def}{=}{N_{AA} + N_{Aa} + N_{aa}}$be the total number of samples. Then the frequency of alleles A and acan be calculated by:

$\begin{matrix}{{p_{A} = \frac{{2N_{AA}} + N_{Aa}}{2N}},{p_{a} = {1 - {p_{A}.}}}} & (9)\end{matrix}$

Thus, HWE indicates that the following counts are expected:

$\begin{matrix}{{E_{AA}\overset{def}{=}{N\; p_{A}^{2}}},{E_{Aa}\overset{def}{=}{2N\; p_{A}p_{a}}},{E_{aa}\overset{def}{=}{N\;{p_{a}^{2}.}}}} & (10)\end{matrix}$Deviation from the HWE is tested by comparing the followingtest-statistic to the χ²-statistic with 1 degree of freedom (3 genotypesminus 2 alleles):

$\begin{matrix}{X^{2}\overset{def}{=}{\sum\limits_{i \in {\{{{AA},{Aa},{aa}}\}}}\;\frac{\left( {N_{i} - E_{i}} \right)^{2}}{E_{i}}}} & (11)\end{matrix}$

To do this, the p-value p of X² is computed according to theχ²-distribution with 1 degree of freedom. It is then concluded that thedata is in HWE only if p>α, for significance level α. When α=0.05, thisreduces to checking if X²<3.84; and when α=0.01, this reduces tochecking if X²<6.64.

The expected counts can be computed as:

$\begin{matrix}{{E_{0} = {N\left( \frac{{2N_{0}} + N_{1}}{2N} \right)}^{2}},{E_{1} = {2{N\left( \frac{{2N_{0}} + N_{1}}{2N} \right)}\left( \frac{{2N_{2}} + N_{1}}{2N} \right)}},{E_{2} = {N\left( \frac{{2N_{2}} + N_{1}}{2N} \right)}^{2}},} & (12)\end{matrix}$which can be simplified to:

$\begin{matrix}{{E_{0} = \frac{\left( {{2N_{0}} + N_{1}} \right)^{2}}{4N}},{E_{1} = \frac{\left( {{2N_{0}} + N_{1}} \right)\left( {{2N_{2}} + N_{1}} \right)}{2N}},{E_{2} = {\frac{\left( {{2N_{2}} + N_{1}} \right)^{2}}{4N}.}}} & (13)\end{matrix}$The test statistic X² can then be computed as:

$\begin{matrix}\begin{matrix}{X^{2} = {\frac{\left( {N_{0} - E_{0}} \right)^{2}}{E_{0}} + \frac{\left( {N_{1} - E_{1}} \right)^{2}}{E_{1}} + \frac{\left( {N_{2} - E_{2}} \right)^{2}}{E_{2}}}} \\{= {\frac{\left( {{4N_{0}N_{2}} - N_{1}^{2}} \right)^{2}}{4{N\left( {{2N_{0}} + N_{1}} \right)}^{2}} + \frac{\left( {{4N_{0}N_{2}} - N_{1}^{2}} \right)^{2}}{2{N\left( {{2N_{0}} + N_{1}} \right)}\left( {{2N_{2}} + N_{1}} \right)} + \frac{\left( {{4N_{0}N_{2}} - N_{1}^{2}} \right)^{2}}{4{N\left( {{2N_{2}} + N_{1}} \right)}^{2}}}} \\{= {\frac{\left( {{4N_{0}N_{2}} - N_{1}^{2}} \right)^{2}}{2N}\left( {\frac{1}{2\left( {{2N_{0}} + N_{1}} \right)^{2}} + \frac{1}{\left( {{2N_{0}} + N_{1}} \right)\left( {{2N_{2}} + N_{1}} \right)} +} \right.}} \\\left. \frac{1}{2\left( {{2N_{2}} + N_{1}} \right)^{2}} \right)\end{matrix} & (14)\end{matrix}$It therefore suffices to return encryptions of α, N, β₁, β₂, β₃, where:

$\begin{matrix}{{\alpha\overset{def}{=}\left( {{4N_{0}N_{2}} - N_{1}^{2}} \right)^{2}},{\beta_{1}\overset{def}{=}{2\left( {{2N_{0}} + N_{1}} \right)^{2}}},{\beta_{2}\overset{def}{=}{\left( {{2N_{0}} + N_{1}} \right)\left( {{2N_{2}} + N_{1}} \right)}},{\beta_{3}\overset{def}{=}{2\left( {{2N_{2}} + N_{1}} \right)^{2}}}} & (15)\end{matrix}$

In view of the foregoing, the Pearson Goodness-Of-Fit (Chi-Squared) Testcan be performed using encrypted genotype frequencies as follows. Theencrypted genotype frequency counts {circumflex over (N)}₀, {circumflexover (N)}₁, {circumflex over (N)}₂ with deg({circumflex over(N)}₀,{circumflex over (N)}₂)=2 and deg({circumflex over (N)}₁)=3generated from the vector of ciphertexts ({circumflex over (z)}₁ , , ,{circumflex over (z)}_(K)) encrypting genotype samples in {−1,0,1,x^(τ)}are input. Ciphertexts {circumflex over (α)},{circumflex over (N)},{circumflex over (β)}₁,{circumflex over (β)}₂,{circumflex over (β)}₃ arethen computed as follows:{circumflex over (α)}←(4{circumflex over (N)} ₀ {circumflex over (N)} ₂−{circumflex over (N)} ₁ ²)² where deg({circumflex over (α)})=12;  (16){circumflex over (N)}←N ₀ +N ₁ +N ₂ where deg({circumflex over (N)})=3;and  (17){circumflex over (β)}←2(2{circumflex over (N)} ₀ +{circumflex over (N)}₁)²{circumflex over (β)}₂←(2{circumflex over (N)} ₀ +{circumflex over (N)}₁)(2{circumflex over (N)} ₂ +{circumflex over (N)} ₁) wheredeg({circumflex over (β)}_(i))=6{circumflex over (β)}₃←2(2{circumflex over (N)} ₂ +{circumflex over (N)}₁)².  (18)

The ciphertexts {circumflex over (α)},{circumflex over (N)},{circumflexover (β)}₁,{circumflex over (β)}₂,{circumflex over (β)}₃ are then output(such as to the user as described previously). Once decrypted anddecoded, X² can be computed such that:

$\begin{matrix}{X^{2} = {\frac{\alpha}{2N}{\left( {\frac{1}{\beta_{1}} + \frac{1}{\beta_{2}} + \frac{1}{\beta_{3}}} \right).}}} & (19)\end{matrix}$1.1.2.2 Estimation Maximization for Haplotyping

Haplotypes cannot be exactly determined from genotypes. For example,consider two bi-allelic loci with alleles A,a and B,b. The genotype AaBbcan be one of two possible haplotypes: (AB)(ab) or (Ab)(aB). EstimationMaximization (EM) is used to estimate haplotype frequencies fromgenotype counts. The EM procedure is an iterative method to estimatehaplotype frequencies, starting with arbitrary initial values p_(AB)⁽⁰⁾, p_(Ab) ⁽⁰⁾, p_(aB) ⁽⁰⁾, p_(ab) ⁽⁰⁾. These values are first used inan estimation stage to calculate the expected genotype frequencies(assuming the initial values are the true haplotype frequencies), andthese, in turn, are used to estimate the haplotype frequencies for thenext iteration in the maximization stage. The procedure stops when thehaplotype frequencies have stabilized.

More particularly, in the ith estimation iteration:

$\begin{matrix}\begin{matrix}{E_{{AB}/{ab}}^{(i)}\overset{def}{=}{\left\lbrack {{N_{{AB}/{ab}}❘N_{AaBb}},p_{AB}^{(0)},p_{Ab}^{(0)},p_{aB}^{(0)},p_{ab}^{(0)}} \right\rbrack}} \\{= {N_{AaBb} \cdot \frac{p_{AB}^{({i - 1})}p_{ab}^{({i - 1})}}{{p_{AB}^{({i - 1})}p_{ab}^{({i - 1})}} + {p_{Ab}^{({i - 1})}p_{aB}^{({i - 1})}}}}}\end{matrix} & (20) \\\begin{matrix}{E_{{Ab}/{aB}}^{(i)}\overset{def}{=}{\left\lbrack {{N_{{Ab}/{aB}}❘N_{AaBb}},p_{AB}^{(0)},p_{Ab}^{(0)},p_{aB}^{(0)},p_{ab}^{(0)}} \right\rbrack}} \\{{= {N_{AaBb} \cdot \frac{p_{Ab}^{({i - 1})}p_{aB}^{({i - 1})}}{{p_{AB}^{({i - 1})}p_{ab}^{({i - 1})}} + {p_{Ab}^{({i - 1})}p_{aB}^{({i - 1})}}}}};}\end{matrix} & (21)\end{matrix}$and in the ith maximization iteration,N _(AB) ^((i))=2N _(AABB) +N _(AABb) +N _(AaBB) +E _(AB/ab) ^((i))N _(ab) ^((i))=2N _(aabb) +N _(aaBb) +N _(Aabb) +E _(AB/ab) ^((i))N _(Ab) ^((i))=2N _(AAbb) +N _(AABb) +N _(Aabb) +E _(Ab/aB) ^((i))N _(aB) ^((i))=2N _(aaBB) +N _(AaBB) +N _(aaBb) +E _(Ab/aB) ^((i))  (22)

Before running the EM procedure, the 9 genotype counts N_(ij) fori,jε{0,1,2} are computed. This can be done with the functionƒ_(ij)(z,z′)=g_(i)(z)·g_(j)(z′). In terms of the encrypted genotypefrequency counts, the foregoing amounts to counting genotype frequenciesfor two loci.

More particularly, given two vectors of ciphertexts ({circumflex over(z)}₁ , , , {circumflex over (z)}_(K)), ({circumflex over (z)}′₁ , , ,{circumflex over (z)}′_(K)) encrypting genotype samples in{−1,0,1,x^(τ)}, for loci 1 and 2 respectively, the genotype frequenciesfor two loci are counted using:{circumflex over (N)} _(ij)←Σ_(k=1) ^(K) g _(i)({circumflex over (z)}_(k))·g _(j)({circumflex over (z)}′ _(k)) for i,jε{0,1,2}.  (23)This produces ciphertexts {circumflex over (N)}₀₀, {circumflex over(N)}₁₀, {circumflex over (N)}₂₀, {circumflex over (N)}₀₁, {circumflexover (N)}₁₁, {circumflex over (N)}₂₁, {circumflex over (N)}₀₂,{circumflex over (N)}₁₂, {circumflex over (N)}₂₂ such thatdeg({circumflex over (N)}_(ij))≦6.

In procedural terms, referring to FIG. 5, the foregoing frequencycomputation for two loci involves first receiving two vectors ofciphertexts, each vector representing encrypted genotypes from adifferent locus and each ciphertext of each vector representing anencrypted genotype sample from the locus associated with that vector(process action 500). Then, an encryption of a count of the pairings ofeach genotype from one of the vectors with each genotype from the othervector is computed as a measure of the frequency of that genotype pair(process 502). It is noted that this last action is performed withoutdecrypting the genomic data.

Referring again to performing the EM procedure, the estimated haplotypefrequencies p_(*) ^((i)) are real numbers. These can be substituted bythe estimated haplotype counts

$N_{*}^{(i)}\overset{def}{=}{2{N \cdot p_{*}^{(i)}}}$since this does not change the fraction in μ_(AB/ab) ^((i)) andμ_(Ab/aB) ^((i)) (essentially, this change multiplies both the numeratorand the denominator by 4N²). This modifies the ith estimation iterationas follows:

$\begin{matrix}{{E_{{AB}/{ab}}^{(i)} = {{N_{11} \cdot \frac{N_{AB}^{({i - 1})}N_{ab}^{({i - 1})}}{{N_{AB}^{({i - 1})}N_{ab}^{({i - 1})}} + {N_{Ab}^{({i - 1})}N_{aB}^{({i - 1})}}}}\overset{def}{=}\frac{\alpha^{(i)}}{\beta^{(i)}}}};} & (24) \\{E_{{Ab}/{aB}}^{(i)} = {{N_{11} \cdot \frac{N_{Ab}^{({i - 1})}N_{aB}^{({i - 1})}}{{N_{AB}^{({i - 1})}N_{ab}^{({i - 1})}} + {N_{Ab}^{({i - 1})}N_{aB}^{({i - 1})}}}}\overset{def}{=}{\frac{\gamma^{(i)}}{\beta^{(i)}}.}}} & (25)\end{matrix}$It is also possible to simplify each iteration so that at any givenpoint, only one numerator and one denominator are needed by defining:

$\begin{matrix}{{{\zeta_{AB}\overset{def}{=}{{2N_{22}} + N_{21} + N_{12}}};}{{\zeta_{ab}\overset{def}{=}{{2N_{00}} + N_{01} + N_{10}}};}{{\zeta_{Ab}\overset{def}{=}{{2N_{20}} + N_{21} + N_{10}}};}{\zeta_{aB}\overset{def}{=}{{2N_{02}} + N_{12} + {N_{01}.}}}} & (26)\end{matrix}$Then:

$\begin{matrix}{{N_{AB}^{(i)} = {{\zeta_{AB} + E_{{AB}/{ab}}^{(i)}} = {{\zeta_{AB} + \frac{\alpha^{(i)}}{\beta^{(i)}}} = \frac{{\zeta_{AB} \cdot \beta^{(i)}} + \alpha^{(i)}}{\beta^{(i)}}}}};} & (27) \\{{N_{ab}^{(i)} = {{\zeta_{ab} + E_{{AB}/{ab}}^{(i)}} = {{\zeta_{ab} + \frac{\alpha^{(i)}}{\beta^{(i)}}} = \frac{{\zeta_{ab} \cdot \beta^{(i)}} + \alpha^{(i)}}{\beta^{(i)}}}}};} & (28) \\{{N_{Ab}^{(i)} = {{\zeta_{Ab} + E_{{Ab}/{aB}}^{(i)}} = {{\zeta_{Ab} + \frac{\gamma^{(i)}}{\beta^{(i)}}} = \frac{{\zeta_{Ab} \cdot \beta^{(i)}} + \gamma^{(i)}}{\beta^{(i)}}}}};} & (29) \\{N_{aB}^{(i)} = {{\zeta_{aB} + E_{{Ab}/{aB}}^{(i)}} = {{\zeta_{aB} + \frac{\gamma^{(i)}}{\beta^{(i)}}} = {\frac{{\zeta_{aB} \cdot \beta^{(i)}} + \gamma^{(i)}}{\beta^{(i)}}.}}}} & (30)\end{matrix}$

At the next estimation iteration (i+1)th, the following is computed:

$\begin{matrix}\begin{matrix}{E_{{AB}/{ab}}^{({i + 1})} = {N_{11} \cdot \frac{\left( \frac{{\zeta_{AB} \cdot \beta^{(i)}} + \alpha^{(i)}}{\beta^{(i)}} \right)\left( \frac{{\zeta_{ab} \cdot \beta^{(i)}} + \alpha^{(i)}}{\beta^{(i)}} \right)}{\begin{matrix}{{\left( \frac{{\zeta_{AB} \cdot \beta^{(i)}} + \alpha^{(i)}}{\beta^{(i)}} \right)\left( \frac{{\zeta_{ab} \cdot \beta^{(i)}} + \alpha^{(i)}}{\beta^{(i)}} \right)} +} \\{\left( \frac{{\zeta_{Ab} \cdot \beta^{(i)}} + \gamma^{(i)}}{\beta^{(i)}} \right)\left( \frac{{\zeta_{aB} \cdot \beta^{(i)}} + \gamma^{(i)}}{\beta^{(i)}} \right)}\end{matrix}}}} \\{= {{N_{11} \cdot \frac{\left( {{\zeta_{AB} \cdot \beta^{(i)}} + \alpha^{(i)}} \right)\left( {{\zeta_{ab} \cdot \beta^{(i)}} + \alpha^{(i)}} \right)}{\begin{matrix}{{\left( {{\zeta_{AB} \cdot \beta^{(i)}} + \alpha^{(i)}} \right)\left( {{\zeta_{ab} \cdot \beta^{(i)}} + \alpha^{(i)}} \right)} +} \\{\left( {{\zeta_{Ab} \cdot \beta^{(i)}} + \gamma^{(i)}} \right)\left( {{\zeta_{aB} \cdot \beta^{(i)}} + \gamma^{(i)}} \right)}\end{matrix}}}\overset{def}{=}\frac{\alpha^{({i + 1})}}{\beta^{({i + 1})}}}}\end{matrix} & (31)\end{matrix}$Similarly,

$\begin{matrix}{E_{{Ab}/{aB}}^{({i + 1})} = {{N_{11} \cdot \frac{\left( {\left( {{\zeta_{Ab} \cdot \beta^{(i)}} + \gamma^{(i)}} \right)\left( {{\zeta_{aB} \cdot \beta^{(i)}} + \gamma^{(i)}} \right)} \right)}{\begin{matrix}{{\left( {{\zeta_{AB} \cdot \beta^{(i)}} + \alpha^{(i)}} \right)\left( {{\zeta_{ab} \cdot \beta^{(i)}} + \alpha^{(i)}} \right)} +} \\{\left( {{\zeta_{Ab} \cdot \beta^{(i)}} + \gamma^{(i)}} \right)\left( {{\zeta_{aB} \cdot \beta^{(i)}} + \gamma^{(i)}} \right)}\end{matrix}}}\overset{def}{=}\frac{\gamma^{({i + 1})}}{\beta^{({i + 1})}}}} & (32)\end{matrix}$

In other words, since the denominator, β^((i)), of the N_(*) ^((i))'salways cancels out, only the numerators need be noted. The numeratorsdepend on β^((i)), so it is still computed as part of the numeratorcomputation, but there is no need to keep track of it after thiscomputation (with the exception that at the last iteration, it isnecessary to divide by β^((i)) to maintain correctness).

In view of the foregoing, the ith estimation iteration is,α^((i)) =N ₁₁ ·N _(AB) ^((i−1)) N _(ab) ^((i−1)),γ^((i)) =N ₁₁ ·N _(Ab)^((i−1)) N _(aB) ^((i−1)),β^((i)) =N _(AB) ^((i−1)) N _(ab) ^((i−1)) +N_(Ab) ^((i−1)) N _(aB) ^((i−1)),  (33)and the ith maximization iteration is,N _(AB) ^((i))=ζ_(AB)·β^((i))+α^((i)) ,N _(ab)^((i))=ζ_(ab)·β^((i))+α^((i)) ,N _(Ab) ^((i))=ζ_(Ab)·β^((i))+γ^((i)) ,N_(aB) ^((i))=ζ_(aB)·β^((i))+γ^((i)).  (34)

In each iteration, the degree goes from D to 2D+6. Starting withunencrypted estimations N_(*) ⁽⁰⁾ (with degree 0), after m iterationsthe degree is 6·(2^(m)−1).

In view of the foregoing, the EM Algorithm for Haplotyping can beperformed using encrypted genotype frequencies as follows. The twovectors of ciphertexts ({circumflex over (z)}₁ , , , {circumflex over(z)}_(K)),({circumflex over (z)}′₁ , , , {circumflex over (z)}′_(K))encrypting genotype samples in {−1,0,1,x^(τ)}, for loci 1 and 2respectively, and number of iterations m to be performed, are input,along with the ciphertexts {circumflex over (N)}₀₀, {circumflex over(N)}₁₀, {circumflex over (N)}₂₀, {circumflex over (N)}₀₁, {circumflexover (N)}₁₁, {circumflex over (N)}₂₁, {circumflex over (N)}₀₂,{circumflex over (N)}₁₂, {circumflex over (N)}₂₂ computed as describedpreviously. Ciphertexts {circumflex over (η)}, {circumflex over (β)} arethen computed as follows:{circumflex over (ζ)}₁←2{circumflex over (N)} ₁₁ +{circumflex over (N)}₁₂ +{circumflex over (N)} ₂₁{circumflex over (ζ)}₂←2{circumflex over (N)} ₃₃ +{circumflex over (N)}₃₂ +{circumflex over (N)} ₂₃//deg({circumflex over (ζ)}_(i))=6;{circumflex over (ζ)}₃←2{circumflex over (N)} ₁₃ +{circumflex over (N)}₁₂ +{circumflex over (N)} ₂₃{circumflex over (ζ)}₄←2{circumflex over (N)} ₃₁ +{circumflex over (N)}₂₁ +{circumflex over (N)} ₃₂  (35){circumflex over (N)}←{circumflex over (N)} ₀₀ +{circumflex over (N)} ₁₀+{circumflex over (N)} ₂₀ +{circumflex over (N)} ₀₁ +{circumflex over(N)} ₁₁ +{circumflex over (N)} ₂₁ +{circumflex over (N)} ₀₂ +{circumflexover (N)} ₁₂ +{circumflex over (N)} ₂₂//deg({circumflex over(N)})=6;  (36){circumflex over (η)}₁ ⁽⁰⁾ ←{circumflex over (N)},{circumflex over (η)}₂ ⁽⁰⁾ ←{circumflex over (N)},{circumflex over (η)} ₃ ⁽⁰⁾ ←{circumflexover (N)},{circumflex over (η)} ₄ ⁽⁰⁾←{circumflex over(N)}//deg({circumflex over (η)}_(i) ⁽⁰⁾)=6;  (37){circumflex over (β)}⁽⁰⁾←2//deg({circumflex over (β)}⁽⁰⁾)=0;  (38)For each iteration i←1 to m{circumflex over (α)}^((i)) ←{circumflex over (N)} ₂₂·{circumflex over(η)}₁ ^((i−1)){circumflex over (η)}₂ ^((i−1))  (39){circumflex over (γ)}^((i)) ←{circumflex over (N)} ₂₂·{circumflex over(η)}₃ ^((i−1)){circumflex over (η)}₄ ^((i−1))//deg({circumflex over(α)}^((i)),{circumflex over (γ)}^((i)))=6·(2^(i)−1)  (40){circumflex over (β)}^((i))←{circumflex over (η)}₁ ^((i−1)){circumflexover (η)}₂ ^((i−1))+{circumflex over (η)}₃ ^((i−1)){circumflex over(η)}₄ ^((i−1))//deg({circumflex over (β)}^((i)))=6·(2^(i)−2)  (41){circumflex over (η)}₁ ^((i))←{circumflex over (ζ)}₁·{circumflex over(β)}^((i))+{circumflex over (α)}^((i)){circumflex over (η)}₂ ^((i))←{circumflex over (ζ)}₂·{circumflex over(β)}^((i))+{circumflex over (α)}^((i))//deg({circumflex over (η)}_(j)^((i)))=6·(2^(i)−1){circumflex over (η)}₃ ^((i))←{circumflex over (ζ)}₃·{circumflex over(β)}^((i))+{circumflex over (γ)}^((i)){circumflex over (η)}₄ ^((i))←{circumflex over (ζ)}₄·{circumflex over(β)}^((i))+{circumflex over (γ)}^((i))  (42){circumflex over (η)}←{circumflex over (η)}₁ ^((m))//deg({circumflexover (η)})=6·(2^(m)−1)  (43){circumflex over (β)}←{circumflex over (β)}^((m))//deg({circumflex over(β)})=6·(2^(m)−2)  (44)

The ciphertexts {circumflex over (η)},{circumflex over (β)} are thenoutput (such as to the user as described previously). Once decrypted anddecoded, the estimated count for haplotype AB (i.e., N_(AB)) can becomputed from η and β such that N_(AB)=η/β. It is noted that N_(AB) isused in calculating the scalar D for linkage disequilibrium as will bedescribed next.

1.1.2.3 Linkage Disequilibrium

Linkage disequilibrium (LD) is an association in the alleles present ateach of two sites in a genome (unlike HWE where it is assumed thealleles at each site are independent). Suppose A and a are possiblealleles at site 1 and B and b are possible alleles at site 2, and letp_(A), p_(a), p_(B), p_(b) be their corresponding populationfrequencies. Similarly, let p_(AB), p_(Ab), p_(aB), p_(ab) be thefrequencies of the haplotypes AB, Ab, aB, ab, respectively. Underlinkage equilibrium, it is expected these frequencies are independent,ie., it is expected that:p _(AB) =p _(A) p _(B) ,p _(Ab) =p _(A) p _(b) ,p _(aB) =p _(a) p _(B),p _(ab) =p _(a) p _(b)  (45)

When the alleles are in linkage disequilibrium (LD), the frequencieswill deviate from the values above by a scalar D, so that:p _(AB) =p _(A) p _(B) +D,p _(Ab) =p _(A) p _(b) −D,N _(aB) =p _(a) p_(B) −D,N _(ab) =p _(a) p _(b) +D  (46)

This scalar D is calculated asD=p_(AB)p_(ab)−p_(Ab)p_(aB)=p_(AB)−p_(A)p_(B). However, the range of Ddepends on the frequencies, which makes it difficult to use it as ameasure of LD. One of two scaled-down variants is used instead, theD′-measure or the r²-measure.

1.1.2.3.1 D′-Measure

It is easy to show that max{−p_(A)p_(B),−p_(a)p_(b)}≦D≦min{p_(A)p_(b),p_(a)p_(B)}, so that the maximumvalue D_(max) that |D| can take is:

$\begin{matrix}{D_{\max} = \left\{ \begin{matrix}{\min\left\{ {{p_{A}p_{b}},{p_{a}p_{B}}} \right\}} & {D > 0} \\{\min\left\{ {{p_{A}p_{B}},{p_{a}p_{b}}} \right\}} & {D < 0}\end{matrix} \right.} & (47)\end{matrix}$The D′-measure is then defined as

$\begin{matrix}{D^{\prime}\overset{def}{=}\frac{D}{D_{\max}}} & (48)\end{matrix}$

The range of D′ is [0,1] with a value of 0 meaning complete equilibriumand a value of 1 meaning complete disequilibrium.

1.1.2.3.2 r²-Measure

The r² measure is given by:

$\begin{matrix}{r^{2}\overset{def}{=}\frac{X^{2}}{N}} & (49)\end{matrix}$where X² is the Pearson Goodness-of-Fit (Chi-Squared) Test statistic:

$\begin{matrix}{X^{2}\overset{def}{=}{\sum\limits_{i \in {{\{{A,a}\}}j} \in {\{{B,b}\}}}\;\frac{\left( {O_{ij} - E_{ij}} \right)^{2}}{E_{ij}}}} & (50)\end{matrix}$where

$O_{ij}\overset{def}{=}N_{ij}$is the observed count and

$E_{ij}\overset{def}{=}{N\; p_{i}p_{j}}$is the expected count. Using the fact that |O_(ij)−E_(ij)|=ND, it can beshown that:

$\begin{matrix}{r^{2} = \frac{D^{2}}{p_{A}p_{B}p_{a}p_{b}}} & (51)\end{matrix}$The range of r² is [0,1] with a value of 0 meaning perfect equilibriumand a value of 1 meaning perfect disequilibrium.1.1.2.3.3 Linkage Disequilibrium Using Encrypted Genotype FrequencyCounts

In view of the foregoing, LD measurement can be performed usingencrypted genotype frequencies as follows. The two vectors ofciphertexts ({circumflex over (z)}₁ , , , {circumflex over (z)}_(K)),({circumflex over (z)}′₁ , , , {circumflex over (z)}′_(K)) encryptinggenotype samples in {−1,0,1,x^(τ)}, for loci 1 and 2 respectively, andnumber of iterations m to be performed, are input, along with theciphertexts {circumflex over (N)}₀₀,{circumflex over (N)}₁₀,{circumflexover (N)}₂₀,{circumflex over (N)}₀₁,{circumflex over (N)}₁₁,{circumflexover (N)}₂₁,{circumflex over (N)}₀₂,{circumflex over (N)}₁₂,{circumflexover (N)}₂₂ computed as described previously. Ciphertexts {circumflexover (δ)}₁,{circumflex over (δ)}₂,{circumflex over (N)}_(A),{circumflexover (N)}_(a),{circumflex over (N)}_(B),{circumflex over (N)}_(b) arethen computed as follows:{circumflex over (N)}←{circumflex over (N)} ₀₀ +{circumflex over (N)} ₁₀+{circumflex over (N)} ₂₀ +{circumflex over (N)} ₀₁ +{circumflex over(N)} ₁₁ +{circumflex over (N)} ₂₁ +{circumflex over (N)} ₀₂ +{circumflexover (N)} ₁₂ +{circumflex over (N)} ₂₂//deg({circumflex over(N)})=6  (52){circumflex over (R)} ₀ ←{circumflex over (N)} ₀₀ +{circumflex over (N)}₀₁ +{circumflex over (N)} ₀₂ ,{circumflex over (R)} ₁ ←{circumflex over(N)} ₁₀ +{circumflex over (N)} ₁₁ +{circumflex over (N)} ₁₂ ,{circumflexover (R)} ₂ ←{circumflex over (N)} ₂₀ +{circumflex over (N)} ₂₁+{circumflex over (N)} ₂₂  (53)Ĉ ₀ ←{circumflex over (N)} ₀₀ +{circumflex over (N)} ₁₀ +{circumflexover (N)} ₂₀ ,Ĉ ₁ ←{circumflex over (N)} ₀₁ +{circumflex over (N)} ₁₁+{circumflex over (N)} ₂₁ ,Ĉ ₂ ←{circumflex over (N)} ₀₂ +{circumflexover (N)} ₁₂ +{circumflex over (N)} ₂₂  (54){circumflex over (N)} _(A)←2{circumflex over (R)} ₀ +{circumflex over(R)} ₁ ,{circumflex over (N)} _(a)←2{circumflex over (R)} ₂ +{circumflexover (R)} ₁//deg({circumflex over (N)} _(A) ,{circumflex over (N)} _(a),{circumflex over (N)} _(B) ,{circumflex over (N)} _(b))=6  (55){circumflex over (N)} _(B)←2Ĉ ₀ +Ĉ ₁ ,{circumflex over (N)} _(b)←2Ĉ ₂ +Ĉ₁  (56){circumflex over (η)},{circumflex over (β)}←EM({circumflex over (z)} ₁ ,, , {circumflex over (z)} _(K) ,{circumflex over (z)}′ ₁ , , ,{circumflex over (z)}′ _(K) ,m)//deg({circumflex over(η)})=6·(2^(m)−1),deg({circumflex over (β)})=6·(2^(m)−2)  (57){circumflex over (δ)}₁←2{circumflex over (N)}{circumflex over(η)}−{circumflex over (N)} _(A) {circumflex over (N)} _(B){circumflexover (β)},{circumflex over (δ)}₂←2{circumflex over (N)}{circumflex over(β)}//deg({circumflex over (δ)}₁,{circumflex over (δ)}₂)≦6·2^(m)  (58)

The ciphertexts {circumflex over (δ)}₁,{circumflex over(δ)}₂,{circumflex over (N)}_(A),{circumflex over (N)}_(a),{circumflexover (N)}_(B),{circumflex over (N)}_(b) are then output (such as to theuser as described previously). Once decrypted and decoded, the LDmeasurement can be computed from δ₁, δ₂, N_(A), N_(a), N_(B), N_(b) suchthat D=δ₁/δ₂

1.2 Encoding and Encrypting Phenotypes

In the same way an encoding E_(g) for genotypes is chosen, an encodingE_(p) is also chosen for phenotypes. There are two possible phenotypes,0 and 1, with 0 being the unaffected phenotype and 1 being the affectedphenotype. There also may be a missing phenotype, which is representedas −9.

In the case of phenotypes, the encoding can be arbitrarily chosen asE_(p):{0,1,−9}→R_(q). To ensure that the coefficients of the encoding(in R_(q)) are small, the following further encoding is employed:

$\begin{matrix}{{E_{p}(z)}\overset{def}{=}\left\{ \begin{matrix}{- 1} & {z = 0} \\1 & {z = 1} \\0 & {z = {- 9}}\end{matrix} \right.} & (59)\end{matrix}$

For the purposes of the encrypted computations, a 2×3 contingency tableof genotype/phenotype counts is needed. To this end, the input is avector

$\overset{->}{z}\overset{def}{=}{\left( {z_{1},,z_{K}} \right)^{T} \in \left\{ {{- 1},0,1,x^{\tau}} \right\}}$of genotype samples, and a vector

$\overset{->}{y}\overset{def}{=}{\left( {y_{1},,y_{K}} \right)^{T} \in \left\{ {0,1} \right\}}$of phenotype samples. For each genotype/phenotype pair (i,j), it isdesired to define an indicator polynomial h_(i,j) such that:

$\begin{matrix}{{h_{i,j}\left( {z,y} \right)} = \left\{ \begin{matrix}1 & {\left( {z,y} \right) = \left( {{E_{g}(i)},{E_{p}(j)}} \right)} \\0 & {\left( {z,y} \right) \neq \left( {{E_{g}(i)},{E_{p}(j)}} \right)}\end{matrix} \right.} & (60)\end{matrix}$

Here again, let (2⁻¹)_(q) denote the inverse of 2 modulo q. Lagrangeinterpolation can be employed to find the polynomial computing eachh_(i,j). In one implementation, the following polynomials are computed(over R_(q)):

$\begin{matrix}{{{h_{i,1}\left( {z,y} \right)}\overset{def}{=}{{{{g_{i}(z)} \cdot \left( 2^{- 1} \right)_{q}}{2 \cdot \left( {y^{2} + y} \right)}} = {{g_{i}(z)} \cdot {g_{2}(y)}}}},{{h_{i,0}\left( {z,y} \right)}\overset{def}{=}{{{{g_{i}(z)} \cdot \left( 2^{- 1} \right)_{q}}{2 \cdot \left( {y^{2} - y} \right)}} = {{g_{i}(z)} \cdot {g_{0}(y)}}}}} & (61)\end{matrix}$Once the polynomials are computed, the encryption can proceed asdescribed previously.1.2.1 Counting Genotype/Phenotype Frequencies

The genomic computations also employ genotype/phenotype frequencies.

Given a vector of ciphertexts ({circumflex over (z)}₁ , , , {circumflexover (z)}_(K)) encrypting genotype samples in {−1,0,1,x^(τ)}, and vectorof ciphertexts (ŷ₁ , , , ŷ_(K)) encrypting phenotypes in {−1,1,0}, inone embodiment the genotype/phenotype frequencies are counted using:{circumflex over (N)} ₀₀←Σ_(k=1) ^(K) g ₀({circumflex over (z)} _(k))·g₀(ŷ _(k)),{circumflex over (N)} ₁₀←Σ_(k=1) ^(K) g ₁({circumflex over(z)} _(k))·g ₀(ŷ _(k)),{circumflex over (N)} ₂₀←Σ_(k=1) ^(K) g₂({circumflex over (z)} _(k))·g ₀(ŷ _(k)){circumflex over (N)} ₀₁←Σ_(k=1) ^(K) g ₀({circumflex over (z)} _(k))·g₂(ŷ _(k)),{circumflex over (N)} ₁₁←Σ_(k=1) ^(K) g ₁({circumflex over(z)} _(k))·g ₂(ŷ _(k)),{circumflex over (N)} ₂₁←Σ_(k=1) ^(K) g₂({circumflex over (z)} _(k))·g ₂(ŷ _(k))  (62)

This produces ciphertexts {circumflex over (N)}₀₀,{circumflex over(N)}₁₀,{circumflex over (N)}₂₀,{circumflex over (N)}₀₁,{circumflex over(N)}₁₁,{circumflex over (N)}₂₁ such that deg({circumflex over(N)}_(0j),{circumflex over (N)}_(2j))=4 and deg({circumflex over(N)}_(1j))=5. It is noted that {circumflex over (N)}₀₀ represents theencrypted genotype/phenotype frequency count of the first homozygousgenotype (0)/unaffected phenotype (0), {circumflex over (N)}₁₀represents the encrypted genotype/phenotype frequency count of theheterozygous genotype (1)/unaffected phenotype (0), {circumflex over(N)}₂₀ represents the encrypted genotype/phenotype frequency count ofthe second homozygous genotype (2)/unaffected phenotype (0), {circumflexover (N)}₀₁ represents the encrypted genotype/phenotype frequency countof the first homozygous genotype (0)/affected phenotype (1), {circumflexover (N)}₁₁ represents the encrypted genotype/phenotype frequency countof the heterozygous genotype (1)/affected phenotype (1), and {circumflexover (N)}₂₁ represents the encrypted genotype/phenotype frequency countof the second homozygous genotype (2)/affected phenotype (1).

In procedural terms, referring to FIG. 6, the foregoinggenotype/phenotype frequency computation involves first receiving avector of ciphertexts, each ciphertext of which represents an encryptedgenomic sample and takes the form of an indicator polynomial and itsassociated coefficients which is indicative of a particular pairing ofthe genotypes and phenotypes (process action 600). Then, an encryptionof a count of the pairings of each genotype and each phenotype iscomputed as a measure of the frequency of that genotype/phenotype pair(process 602). It is noted that this last action is performed withoutdecrypting the genomic data.

1.2.2 Genomic Computations Using Encrypted Genotype andGenotype/Phenotype Frequencies

Once the encrypted genotype/phenotype frequencies are computed, variousgenomic computation can be performed using the encrypted data. Forexample, the aforementioned Cochran-Armitage test for trend computationscan be performed, as will be described in more detail in the sections tofollow. It is noted that the definitions described previously also applyto the following descriptions.

1.2.2.1 Cochran-Armitage Test for Trend (CATT)

The Cochran-Armitage test for trend is used for testing associationbetween a candidate allele A and a disease in a case-control study. Theinput data is a 2×3 contingency table of 3 genotypes vs. case/controls,such as the exemplary table shown in FIG. 7.

The CATT computes the statistic:

$\begin{matrix}{T\overset{def}{=}{\sum\limits_{i = 1}^{3}\;{w_{i}\left( {{N_{1\; i}R_{2}} - {N_{2\; i}R_{1}}} \right)}}} & (63)\end{matrix}$where

$\overset{\rightarrow}{w}\overset{def}{=}\left( {w_{1},w_{2},w_{3}} \right)$is a vector of pre-determined weights, and the difference(N_(1i)R₂−N_(2i)R₁) can be thought of as the difference N_(1i)−N_(2i)after reweighing the rows to have the same sum.

The variance of this statistic can be computed as:

$\begin{matrix}{{{Var}(T)} = {\frac{R_{1}R_{2}}{N}\left( {{\sum\limits_{i = 1}^{3}\;{w_{i}^{2}{C_{i}\left( {N - C_{i}} \right)}}} - {2{\sum\limits_{i = 1}^{k - 1}\;{\sum\limits_{j = {i + 1}}^{k}\;{w_{i}w_{j}C_{i}C_{j}}}}}} \right)}} & (64)\end{matrix}$The test statistic X² is then defined as follows and compared to aχ²-statistic with 1 degree of freedom:

$\begin{matrix}{X^{2}\overset{def}{=}\frac{T^{2}}{{Var}(T)}} & (65)\end{matrix}$

As in the Pearson Goodness-of-Fit (Chi-Squared) Test, the p-value p ofX² according to the χ²-distribution with 1 degree of freedom iscomputed. It is then concluded that there is no association between thecandidate allele and the disease if p>α, for significance level α. Whenα=0.05, this reduces to checking if X²<3.84; and when α=0.01, thisreduces to checking if X²<6.64.

The weights {right arrow over (w)}=(w₁,w₂,w₃) are chosen as follows. Theweights {right arrow over (w)}=(0,1,2) are used for the additive(co-dominant) model, {right arrow over (w)}=(0,1,1) for the dominantmodel (A is dominant over a), and {right arrow over (w)}=(0,0,1) for therecessive model (A is recessive to allele a).

1.2.2.2 Cochran-Armitage Test for Trend Using EncryptedGenotype/Phenotypes Frequency Counts

In view of the foregoing, CATT can be performed using encryptedgenotype/phenotype frequencies as follows. A vector of ciphertexts({circumflex over (z)}₁ , , , {circumflex over (z)}_(K)) encryptinggenotype samples in {−1,0,1,x^(τ)}, a vector of ciphertexts (ŷ₁ , , ,ŷ_(K)) encrypting phenotypes in {−1,0,1}, and vector of (plaintext)weights (w₀,w₁,w₂), are input, along with the ciphertexts {circumflexover (N)}₀₀,{circumflex over (N)}₁₀,{circumflex over (N)}₂₀,{circumflexover (N)}₀₁,{circumflex over (N)}₁₁,{circumflex over (N)}₂₁ representingthe previously computed encrypted genotype/phenotype frequency counts.Ciphertexts {circumflex over (α)},{circumflex over (β)} are thencomputed as follows:{circumflex over (R)} ₀ ←{circumflex over (N)} ₀₀ +{circumflex over (N)}₁₀ +{circumflex over (N)} ₂₀ ,{circumflex over (R)} ₁ ←{circumflex over(N)} ₀₁ +{circumflex over (N)} ₁₁ +{circumflex over (N)}₂₁//deg({circumflex over (R)} _(i))=3  (66)Ĉ ₀ ←{circumflex over (N)} ₀₀ +{circumflex over (N)} ₀₁ ,Ĉ ₁←{circumflex over (N)} ₁₀ +{circumflex over (N)} ₁₁ ,Ĉ ₂ ←{circumflexover (N)} ₂₀ +{circumflex over (N)} ₂₁//deg(Ĉ _(i))=3  (67){circumflex over (N)}←{circumflex over (R)} ₀ +{circumflex over (R)}₁//deg({circumflex over (N)})=3  (68){circumflex over (T)}←Σ _(i=0) ² w _(i)({circumflex over (N)} _(i0){circumflex over (R)} ₁ −{circumflex over (N)} _(i1) {circumflex over(R)} ₀)//deg({circumflex over (T)})=6  (69){circumflex over (α)}←{circumflex over (N)}·{circumflex over (T)}²//deg({circumflex over (α)})=15  (70){circumflex over (β)}←{circumflex over (R)} ₀ {circumflex over (R)}₁(Σ_(i=0) ² w _(i) ² Ĉ _(i)(N−Ĉ _(i))−2Σ_(i=0) ¹Σ_(j=i+1) ² w _(i) w_(j) Ĉ _(i) Ĉ _(j))//deg({circumflex over (β)})=12  (71)

The ciphertexts {circumflex over (α)},{circumflex over (β)} are thenoutput (such as to the user as described previously). Once decrypted anddecoded, the aforementioned test statistic X² can be computed from α,βsuch that X²=α/β

2.0 Exemplary Operating Environments

The genomic data encryption embodiments described herein are operationalwithin numerous types of general purpose or special purpose computingsystem environments or configurations, as indicated previously. FIG. 8illustrates a simplified example of a general-purpose computer system onwhich various embodiments and elements of genomic data encryption, asdescribed herein, may be implemented. It is noted that any boxes thatare represented by broken or dashed lines in the simplified computingdevice 10 shown in FIG. 8 represent alternate embodiments of thesimplified computing device. As described below, any or all of thesealternate embodiments may be used in combination with other alternateembodiments that are described throughout this document. The simplifiedcomputing device 10 is typically found in devices having at least someminimum computational capability such as personal computers (PCs),server computers, handheld computing devices, laptop or mobilecomputers, communications devices such as cell phones and personaldigital assistants (PDAs), multiprocessor systems, microprocessor-basedsystems, set top boxes, programmable consumer electronics, network PCs,minicomputers, mainframe computers, and audio or video media players.

To allow a device to implement the genomic data encryption embodimentsdescribed herein, the device should have a sufficient computationalcapability and system memory to enable basic computational operations.In particular, the computational capability of the simplified computingdevice 10 shown in FIG. 8 is generally illustrated by one or moreprocessing unit(s) 12, and may also include one or more graphicsprocessing units (GPUs) 14, either or both in communication with systemmemory 16. Note that that the processing unit(s) 12 of the simplifiedcomputing device 10 may be specialized microprocessors (such as adigital signal processor (DSP), a very long instruction word (VLIW)processor, a field-programmable gate array (FPGA), or othermicro-controller) or can be conventional central processing units (CPUs)having one or more processing cores.

In addition, the simplified computing device 10 shown in FIG. 8 may alsoinclude other components such as a communications interface 18. Thesimplified computing device 10 may also include one or more conventionalcomputer input devices 20 (e.g., pointing devices, keyboards, audio(e.g., voice) input devices, video input devices, haptic input devices,gesture recognition devices, devices for receiving wired or wirelessdata transmissions, and the like). The simplified computing device 10may also include other optional components such as one or moreconventional computer output devices 22 (e.g., display device(s) 24,audio output devices, video output devices, devices for transmittingwired or wireless data transmissions, and the like). Note that typicalcommunications interfaces 18, input devices 20, output devices 22, andstorage devices 26 for general-purpose computers are well known to thoseskilled in the art, and will not be described in detail herein.

The simplified computing device 10 shown in FIG. 8 may also include avariety of computer-readable media. Computer-readable media can be anyavailable media that can be accessed by the computer 10 via storagedevices 26, and can include both volatile and nonvolatile media that iseither removable 28 and/or non-removable 30, for storage of informationsuch as computer-readable or computer-executable instructions, datastructures, program modules, or other data. Computer-readable mediaincludes computer storage media and communication media. Computerstorage media refers to tangible computer-readable or machine-readablemedia or storage devices such as digital versatile disks (DVDs), compactdiscs (CDs), floppy disks, tape drives, hard drives, optical drives,solid state memory devices, random access memory (RAM), read-only memory(ROM), electrically erasable programmable read-only memory (EEPROM),flash memory or other memory technology, magnetic cassettes, magnetictapes, magnetic disk storage, or other magnetic storage devices.

Retention of information such as computer-readable orcomputer-executable instructions, data structures, program modules, andthe like, can also be accomplished by using any of a variety of theaforementioned communication media (as opposed to computer storagemedia) to encode one or more modulated data signals or carrier waves, orother transport mechanisms or communications protocols, and can includeany wired or wireless information delivery mechanism. Note that theterms “modulated data signal” or “carrier wave” generally refer to asignal that has one or more of its characteristics set or changed insuch a manner as to encode information in the signal. For example,communication media can include wired media such as a wired network ordirect-wired connection carrying one or more modulated data signals, andwireless media such as acoustic, radio frequency (RF), infrared, laser,and other wireless media for transmitting and/or receiving one or moremodulated data signals or carrier waves.

Furthermore, software, programs, and/or computer program productsembodying some or all of the various genomic data encryption embodimentsdescribed herein, or portions thereof, may be stored, received,transmitted, or read from any desired combination of computer-readableor machine-readable media or storage devices and communication media inthe form of computer-executable instructions or other data structures.

Finally, the genomic data encryption embodiments described herein may befurther described in the general context of computer-executableinstructions, such as program modules, being executed by a computingdevice. Generally, program modules include routines, programs, objects,components, data structures, and the like, that perform particular tasksor implement particular abstract data types. The data extractiontechnique embodiments may also be practiced in distributed computingenvironments where tasks are performed by one or more remote processingdevices, or within a cloud of one or more devices, that are linkedthrough one or more communications networks. In a distributed computingenvironment, program modules may be located in both local and remotecomputer storage media including media storage devices. Additionally,the aforementioned instructions may be implemented, in part or in whole,as hardware logic circuits, which may or may not include a processor.

3.0 Other Embodiments

The foregoing description of the various genomic data encryptionembodiments involved encoding the genomic data before encrypting it. Thefollowing describes another encoding embodiment that is designed toencode real numbers, and which can be employed for encoding genomicdata.

For a real number αεR,

${{decomp}(\alpha)}\overset{def}{=}\left( {\alpha_{k_{\alpha}},,\alpha_{0},\alpha_{- 1},} \right)$is defined to be the (possibly infinite) binary decomposition of α, sothat:

$\begin{matrix}{\alpha = {\sum\limits_{i = {- \infty}}^{k_{\alpha}}\;{\alpha_{i}{2^{i}.}}}} & (72)\end{matrix}$For a sequence

${\overset{\rightarrow}{\alpha}\overset{def}{=}\left( {\alpha_{k},,\alpha_{0},\alpha_{- 1},,\alpha_{- s}} \right)},{{{real}\left( \overset{\rightarrow}{\alpha} \right)}\overset{def}{=}{\sum\limits_{i = {- s}}^{k}\;{\alpha_{i}{2^{i}.}}}}$

The ring

$R\overset{def}{=}{{Z\lbrack x\rbrack}/\left\langle {x^{n} + 1} \right\rangle}$is employed and λ is used to denote its expansion factor. For thisspecific ring, λ=n.

Now let αεR be a real number that it is desired to encode as apolynomial in R=Z[x]/

x^(n)+1

, and let u′ be the (decimal) precision it is desired to maintain. Let

${F(z)}\overset{def}{=}{\sum\limits_{i = 0}^{D}\;{a_{i}z^{i}}}$be a degree-D polynomial that can be used in performing computations onthe encoded (and encrypted using an appropriate homomorphic encryptionscheme) real number data, i.e., it is desired to compute F(α) usingdecimal precision u′.

Let decomp(α)=(α_(k) , , , α₀, α⁻¹,). Since it is desired to maintaindecimal precision u′, binary precision

$u\overset{def}{=}{{4\; u^{\prime}} > {{\log_{2}(10)}u^{\prime}}}$will be maintained. Therefore only the truncated decomposition

$\overset{->}{\alpha}\overset{def}{=}{\left( {\alpha_{k},,\alpha_{0},\alpha_{- 1},,\alpha_{- u}} \right) \in \left\{ {0,1} \right\}^{k + u + 1}}$is considered, so that

$\overset{\sim}{\alpha}\overset{def}{=}{{REAL}\left( \overset{->}{\alpha} \right)}$approximates α up to decimal precision u′. Computing

$F\;\overset{\bullet}{(\alpha)}$is the only concern.

For i=0,k+u,

$\beta_{i}\overset{def}{=}{\alpha_{i - u} \in \left\{ {0,1} \right\}}$is defined and the encoding of α in R is defined as the polynomial:

$\begin{matrix}{{e_{\alpha}(x)}\overset{def}{=}{{\sum\limits_{i = 0}^{k + u}{\beta_{i}x^{i}}} \in {R_{2}.}}} & (73)\end{matrix}$

The idea is then to perform the computation on e_(α) and decode byevaluating the resulting polynomial at x=2. However, note thate_(α)(2)=2^(u)·{tilde over (α)}. Therefore,

$\begin{matrix}{{F\left( \overset{\sim}{\alpha} \right)} = {{F\left( \frac{e_{\alpha}(2)}{2^{u}} \right)} = {\sum\limits_{i = 0}^{D}{\left( \frac{a_{i}}{2^{i \cdot u}} \right){{e_{\alpha}(2)}.}}}}} & (74)\end{matrix}$

Multiplying by 2^(Du) results in:

$\begin{matrix}\begin{matrix}{{2^{Du} \cdot {F\left( \overset{\sim}{\alpha} \right)}} = {\sum\limits_{i = 0}^{D}{a_{i} \cdot 2^{{({D - i})}u} \cdot {e_{\alpha}(2)}^{i}}}} \\{= {\sum\limits_{i = 0}^{D}{{a_{i} \cdot x^{{({D - i})}u} \cdot {e_{\alpha}(x)}^{i}}\left( {{mod}\left( {x - 2} \right)} \right)}}}\end{matrix} & (75)\end{matrix}$where the last equality holds as long as there is no reduction modulox^(n)+1, that is, as long as Du+D(k+u)=D(2u+└log α┘)<n.

Now define

${G(z)}\overset{def}{=}{\sum\limits_{i = 0}^{D}{a_{i} \cdot x^{{({D - i})}u} \cdot {z^{i}.}}}$Then,

$\begin{matrix}{{F\left( \overset{\sim}{\alpha} \right)} = {\frac{{G\left( e_{\alpha} \right)}\left( {{mod}\left( {x - 2} \right)} \right)}{2^{Du}}.}} & (76)\end{matrix}$This means that to compute F({tilde over (α)}), simply compute G(e_(α)),evaluate the resulting polynomial at x=2, and divide by 2^(Du). Thefunction G is a transformation of the original function F, and isdesigned to be used in performing computations on the encoded encryptedreal number data. Thus, the encrypted data will be plugged into thatfunction G to carry out the actual computation on encrypted data. Thereason for transforming the original function F into G is that thecoefficients of the encrypted intermediate results stay smaller. This inturn allows for better parameters and more efficient schemes.

In procedural terms, referring to FIG. 9, the foregoing homomorphicpolynomial encryption scheme for encrypting real numbers involves theencoding and encrypting entity (e.g., the aforementioned user asrepresented by the user computer) generating a bit decomposition of thereal number to be encoded (process action 900). This corresponds to thepreviously introduced equation

${{real}\left( \overset{->}{\alpha} \right)}\overset{def}{=}{\sum\limits_{i = {- s}}^{k}{\alpha_{i}2^{i}}}$for the bit decomposition sequence of the real number

$\overset{->}{\alpha}\overset{def}{=}{\left( {\alpha_{k},,\alpha_{0},\alpha_{- 1},\alpha_{- s}} \right).}$The bit decomposition is then converted to a truncated bit decompositionbased on the desired precision (process action 902). Namely, the desireddecimal precision u′ which corresponds to the binary precision u, asdescribed previously. Thus, as described previously, the truncated bitdecomposition

$\overset{->}{\alpha}\overset{def}{=}\left( {\alpha_{k},,\alpha_{0},\alpha_{- 1},,\alpha_{- u}} \right)$results. The truncated bit decomposition of the real number is thenencoded using the polynomial

${{e_{\alpha}(x)} = {{\sum\limits_{i = 0}^{k + u}{\beta_{i}x^{i}}} \in R_{2}}},$where β_(i)=α_(i−u) and k+u is the total number of bits in the truncatedbit decomposition (process action 904). Then, in process action 906, theencoded real number is encrypted using an appropriate homomorphicencryption scheme. The encoded and encrypted real number data can thenbe provided to the entity or entities that perform storage andcomputations on the data (e.g., the previously described cloud-basedentity or entities).

Referring now to FIG. 10, in process action 1000, the storage andcomputations entity or entities receive the encoded and encrypted realnumber, along with other similarly encoded and encrypted real numberswhich form a data set (such as the previously described genomic data).The encoded and encrypted data set is then computed on using theequation G(e_(α))=Σ_(i=0) ^(D)a_(i)·x^((D−i)u)·e_(α) ^(i), which isknown along with the coefficients a_(i) to both the encoding andencrypting entity and the storage and computations entity or entities(process action 1002). It is noted that D is the degree of thepolynomial function G (and F). It is also noted that the desiredcomputations (e.g., one or more of the genomic computations describedpreviously) are performed without decrypting the data. In process action1004 the encrypted results are sent to the end user (e.g., theencrypting entity).

Referring now the FIG. 11, the end user receives the encrypted results(process action 1100), which are in the form of ciphertexts. Theencrypted results are first decrypted using the appropriate homomorphicdecryption scheme to recover the plaintext polynomials (process action1102). Then, in process action 1104, the decrypted plaintext polynomialsare transformed using Eq. (76) i.e.,

${{F\left( \overset{\sim}{\alpha} \right)} = \frac{{G\left( e_{\alpha} \right)}\left( {{mod}\left( {x - 2} \right)} \right)}{2^{Du}}},$and evaluated at x=2 to obtain {tilde over (α)}, which represents thetruncated real number representing the results.

It is also noted that any or all of the aforementioned embodimentsthroughout the description may be used in any combination desired toform additional hybrid embodiments. In addition, although the subjectmatter has been described in language specific to structural featuresand/or methodological acts, it is to be understood that the subjectmatter defined in the appended claims is not necessarily limited to thespecific features or acts described above. Rather, the specific featuresand acts described above are disclosed as example forms of implementingthe claims.

Wherefore, what is claimed is:
 1. A computer-implemented process forencrypting genomic data, comprising: using a hardware processor of acomputer to perform the following process actions: receiving genomicdata comprising genotypes, said genotypes consisting of a heterozygousgenotype, a first homozygous genotype, a second homozygous genotype, andan unknown genotype in the case where the actual genotype is unknown;encoding the genomic data as polynomials in a message space of ahomomorphic encryption scheme, wherein, each of the first homozygousgenotypes is represented in the polynomials by an integer −1, each ofthe heterozygous genotypes is represented in the polynomials by aninteger 0, each of the second homozygous genotypes is represented in thepolynomials by an integer 1, and each unknown genotype is represented inthe polynomials by a function that when encrypted comprises a non-zeropolynomial, but which when decrypted produces a zero; and encrypting theencoded genomic data using the homomorphic polynomial encryption schemeto produce a vector of ciphertexts, each ciphertext of which representsa different sample of the genomic data and takes the form of apolynomial and its associated coefficients, wherein the encryptedgenomic data can be used in genomic computations without having to bedecrypted.
 2. The process of claim 1, further comprising an action oftransmitting the encoded and encrypted genomic data via a computernetwork for storage and genomic computations.
 3. The process of claim 2,further comprising the actions of: receiving results of the genomiccomputations on the encoded and encrypted genomic data, said genomiccomputations having been performed without decoding and decrypting thedata and said results exhibiting the same encoding and encryption as thegenomic data; decrypting the received results using a homomorphicpolynomial decryption scheme applicable to the homomorphic polynomialencryption scheme used to encrypt the encoded genomic data; and decodingthe decrypted results using a decoding scheme applicable to the encodingscheme used to encode the genomic data.
 4. The process of claim 1,wherein the process action of encrypting the encoded genomic data usinga homomorphic polynomial encryption scheme to produce a vector ofciphertexts, each ciphertext of which represents a different sample ofthe genomic data and takes the form of a polynomial and its associatedcoefficients, comprises an action of forming the polynomial as anindicator polynomial which is indicative of the genotype of the sample.5. The process of claim 1, wherein the genomic data further comprisesphenotypes, said phenotypes consisting of an unaffected phenotype, anaffected phenotype, and an unknown phenotype in the case where theactual phenotype is unknown, and wherein the process action of encodingthe genomic data, comprises the actions of: representing each of theunaffected phenotypes in the polynomials in the message space of thehomomorphic encryption scheme by an integer −1; representing each of theaffected phenotypes in the polynomials in the message space of thehomomorphic encryption scheme by an integer 1; and representing eachunknown phenotype in the polynomials in the message space of thehomomorphic encryption scheme by an integer
 0. 6. The process of claim5, wherein the process action of encrypting the encoded genomic datausing a homomorphic polynomial encryption scheme to produce a vector ofciphertexts, each ciphertext of which represents a different sample ofthe genomic data and takes the form of a polynomial and its associatedcoefficient, comprises an action of forming the polynomial as anindicator polynomial which is indicative of the pairing of the genotypeand phenotype of the sample.
 7. The process of claim 1, wherein theprocess action of encrypting the encoded genomic data using ahomomorphic polynomial encryption scheme comprises encrypting theencoded genomic data using somewhat homomorphic encryption (SwHE)scheme.
 8. A system for performing genomic computations on encryptedgenomic data, comprising: one or more computing devices, wherein saidcomputing devices are in communication with each other via a computernetwork whenever there are multiple computing devices; and a computerprogram having program modules executable by one or more hardwareprocessors of the one or more computing devices, the one or morecomputing devices being directed by the program modules of the computerprogram to, receive the encrypted genomic data, said genomic data havingbeen encrypted using a homomorphic polynomial encryption scheme toproduce one or more vectors of ciphertexts, each ciphertext of eachvector represents a different sample of the genomic data and takes theform of a polynomial and its associated coefficients, and wherein thegenomic data comprising genotypes, said genotypes consisting of aheterozygous genotype, a first homozygous genotype, a second homozygousgenotype, and an unknown genotype in the case where the actual genotypeis unknown, and perform one or more genomic computations on the vectoror vectors of ciphertexts of the received said encrypted genomic datawithout decrypting the genomic data, said genomic computationscomprising computing an encryption of a count of each genotype presentin each received vector of ciphertexts as a measure of a frequency ofthat genotype, said count of each genotype being computed withoutdecrypting the genomic data.
 9. The system of claim 8, furthercomprising a program module for transmitting the results of the genomiccomputations to an end user, said results of the genomic computationsexhibiting the same encryption exhibited by the genomic data.
 10. Thesystem of claim 9, wherein the end user is the same entity thatencrypted the genomic data.
 11. The system of claim 9, wherein the enduser is a different entity than the entity that encrypted the genomicdata, but has been authorized by the encrypting entity to decrypt theresults.
 12. The system of claim 8, wherein the program module forperforming one or more genomic computations on the vector or vectors ofciphertexts received without decrypting the genomic data, furthercomprises a sub-module for performing a Pearson goodness-of-fit(chi-squared) test to measure data quality using the counts of eachgenotype.
 13. A computer-implemented process for performing computationson data comprising: using a hardware processor of a computer to performthe following process actions on the data: receiving the data comprisingencoded and encrypted real numbers, wherein each encoded and encryptedreal number was encoded by generating a bit decomposition of the realnumber, converting the bit decomposition to a truncated bitdecomposition {right arrow over (α)}=(α_(k) , , , α₀, α⁻¹ , , , α_(−u))based on the desired precision u, encoding the real number using thepolynomial${{e_{\alpha}(x)}\overset{def}{=}{{\sum\limits_{i = 0}^{k + u}{\beta_{i}x^{i}}} \in R_{2}}},$where β_(i)=α_(i−u) and k+u is the total number of bits in the truncatedbit decomposition and encrypting the encoded real number using ahomomorphic encryption scheme; performing computations on the encodedand encrypted real numbers without decryption to produce an encryptedresults using an equation in the form of G(e_(a))=Σ_(i=0)^(D)a_(i)·x^((D−i)u)·e_(a) ^(i), where D is the degree of the polynomialand the a_(i)'s are prescribed coefficients; transmitting the encryptedresults, wherein each encoded and encrypted real number in the encryptedresults can be subsequently decrypted using an homomorphic decryptionscheme, and decoded by transforming the decrypted results using${{F\left( \overset{\sim}{\alpha} \right)} = \frac{{G\left( e_{\alpha} \right)}\left( {{mod}\left( {x - 2} \right)} \right)}{2^{Du}}},$and evaluated at x=2 to obtain {tilde over (α)}, which represents thetruncated real number representing the decoded results.
 14. A system forperforming genomic computations on encrypted genomic data, comprising:one or more computing devices, wherein said computing devices are incommunication with each other via a computer network whenever there aremultiple computing devices; and a computer program having programmodules executable by one or more hardware processors of the one or morecomputing devices, the one or more computing devices being directed bythe program modules of the computer program to, receive the encryptedgenomic data in the form of two vectors of ciphertexts, each vectorrepresenting encrypted genotypes from a different locus and eachciphertext of each vector representing an encrypted genotype sample fromthe locus associated with that vector, wherein the genomic data has beenencrypted using a homomorphic polynomial encryption scheme to produceone or more vectors of ciphertexts, each ciphertext of which representsa different sample of the genomic data and takes the form of apolynomial and its associated coefficients, and wherein the genomic datacomprising genotypes, said genotypes consisting of a heterozygousgenotype, a first homozygous genotype, a second homozygous genotype, andan unknown genotype in the case where the actual genotype is unknown,and perform one or more genomic computations on the vectors ofciphertexts of the received said encrypted genomic data withoutdecrypting the genomic data, said genomic computations comprisingcomputing an encryption of a count of the pairings of each genotype fromone of the vectors with each genotype from the other vector as a measureof a frequency of that genotype pair, said count of each genotype paringbeing computed without decrypting the genomic data.
 15. The system ofclaim 14, wherein the program module for performing one or more genomiccomputations on the or vectors of ciphertexts received withoutdecrypting the genomic data, further comprises a sub-module forperforming an estimation maximization for haplotyping to estimatehaplotype frequencies from the genotype pairing counts.
 16. The systemof claim 14, wherein the program module for performing one or moregenomic computations on the vector Of vectors of ciphertexts receivedwithout decrypting the genomic data, further comprises a sub-module forperforming a linkage disequilibrium measurement to estimate correlationbetween genes from the genotype pairing counts.
 17. A system forperforming genomic computations on encrypted genomic data, comprising:one or more computing devices, wherein said computing devices are incommunication with each other via a computer network whenever there aremultiple computing devices; and a computer program having programmodules executable by one or more hardware processors of the one or morecomputing devices, the one or more computing devices being directed bythe program modules of the computer program to, receive the encryptedgenomic data in the form of a vector of ciphertexts, each ciphertext ofwhich represents an encrypted genomic sample and takes the form of anindicator polynomial and its associated coefficients which is indicativeof the pairing of the genotype and phenotype of the sample, said genomicdata having been encrypted using a homomorphic polynomial encryptionscheme to produce said vector of ciphertexts, and wherein the genomicdata comprising genotypes, said genotypes consisting of a heterozygousgenotype, a first homozygous genotype, a second homozygous genotype, andan unknown genotype in the case where the actual genotype is unknown,and wherein the genomic data further comprises phenotypes, saidphenotypes consisting of an unaffected phenotype, an affected phenotype,and an unknown phenotype in the case where the actual phenotype isunknown, and perform one or more genomic computations on the vector ofciphertexts of the received said encrypted genomic data withoutdecrypting the genomic data, said genomic computations comprisingcomputing an encryption of a count of the pairings of each genotype andeach phenotype as a measure of a frequency of that genotype/phenotypepair, said count of each genotype/phenotype paring being computedwithout decrypting the genomic data.
 18. The system of claim 17, whereinthe program module for performing one or more genomic computations onthe vector of ciphertexts received without decrypting the genomic data,further comprises a sub-module for performing a Cochran-Armitage testfor trends on the correlations between genotypes and phenotypes from thegenotype/phenotype pairing counts.